High-Performance Storage Support for Scientific Big Data Applications on the Cloud

نویسندگان

  • Dongfang Zhao
  • Akash Mahakode
  • Sandip Lakshminarasaiah
  • Ioan Raicu
چکیده

This work studies the storage subsystem for scientific big data applications to be running on the cloud. Although cloud computing has become one of the most popular paradigms for executing data-intensive applications, the storage subsystem has not been optimized for scientific applications. In particular, many scientific applications were originally developed assuming a tightly-coupled cluster of compute nodes with network-attached storage allowing massively parallel I/O accesses—the high-performance computing (HPC) systems. These applications, in turn, struggle in leveraging cloud platforms whose design goal is fundamentally different than that of HPC systems. We believe that when executing scientific applications in the cloud, a node-local distributed storage architecture is a key approach to overcome the challenges from the storage subsystem. We analyze and evaluate four representative file systems (S3FS, HDFS, Ceph, and FusionFS) on multiple platforms (Kodiak cluster, Amazon EC2) with a variety of benchmarks to explore how well these storage systems can handle metadata-intensive, write-intensive, and read-intensive workloads. Moreover, we elaborate the design and implementation of FusionFS that employs a scalable approach to managing both metadata and data in addition to its unique features on cooperative caching, dynamic compression, GPUaccelerated data redundancy, lightweight provenance, and parallel serialization. Dongfang Zhao University of Washington, Seattle, WA, USA. e-mail: [email protected] Akash Mahakode Illinois Institute of Technology, Chicago, IL USA. e-mail: [email protected] Sandip Lakshminarasaiah Illinois Institute of Technology, Chicago, IL USA. e-mail: [email protected] Ioan Raicu Illinois Institute of Technology, Chicago, IL, USA. e-mail: [email protected]

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cloud Computing Technology Algorithms Capabilities in Managing and Processing Big Data in Business Organizations: MapReduce, Hadoop, Parallel Programming

The objective of this study is to verify the importance of the capabilities of cloud computing services in managing and analyzing big data in business organizations because the rapid development in the use of information technology in general and network technology in particular, has led to the trend of many organizations to make their applications available for use via electronic platforms hos...

متن کامل

Data Replication-Based Scheduling in Cloud Computing Environment

Abstract— High-performance computing and vast storage are two key factors required for executing data-intensive applications. In comparison with traditional distributed systems like data grid, cloud computing provides these factors in a more affordable, scalable and elastic platform. Furthermore, accessing data files is critical for performing such applications. Sometimes accessing data becomes...

متن کامل

A Cloud Framework for Big Data Analytics Workflows on Azure

Since digital data repositories are more and more massive and distributed, we need smart data analysis techniques and scalable architectures to extract useful information from them in reduced time. Cloud computing infrastructures offer an effective support for addressing both the computational and data storage needs of big data mining applications. In fact, complex data mining tasks involve dat...

متن کامل

Communication-Aware Traffic Stream Optimization for Virtual Machine Placement in Cloud Datacenters with VL2 Topology

By pervasiveness of cloud computing, a colossal amount of applications from gigantic organizations increasingly tend to rely on cloud services. These demands caused a great number of applications in form of couple of virtual machines (VMs) requests to be executed on data centers’ servers. Some of applications are as big as not possible to be processed upon a single VM. Also, there exists severa...

متن کامل

A Cost-Effective Strategy for Storing Scientific Datasets with Multiple Service Providers in the Cloud

Cloud computing provides scientists a platform that can deploy computation and data intensive applications without infrastructure investment. With excessive cloud resources and a decision support system, large generated datasets can be flexibly 1) stored locally in the current cloud, 2) deleted and regenerated whenever reused or 3) transferred to cheaper cloud service for storage. However, due ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016